Topic Spotting on News Articles with Topic Repository by Controlled Indexing

نویسندگان

  • Taeho Jo
  • Jerry H. Seo
  • Hyeon Kim
چکیده

Topic spotting is the task of assigning a category to the document, among the predefined categories. Topic spotting is called text categorization. Controlled indexing is the procedure of extracting the informative terms reflecting its contents, from the text. There are two kinds of repositories, in the proposed scheme of topic spotting; one is the integrated repository for controlled indexing and the other is topic repository for topic spotting. Repository is constructed by learning the texts, and consists of terms and their associated information: the total frequency and IDF (Inverted Document Frequency). An unknown text is represented into the list of informative terms by controlled indexing referring the integrated repository and the category corresponding to the largest weight is determined as the topic (category) of the text . In order to validate, the news articles from the site, “http://www.newspage.com” are used as examples, in the experiment of this paper.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic recognition for news speech based on keyword spotting

This paper describes topic identi cation for Japanese TV news speech based on the keyword spotting technique. Three thousands of nouns are selected as keywords which contribute to topic identi cation, based on criterion of mutual information and a length of the word. This set of the keywords identi ed the correct topic for 76.3% of articles from newspaper text data. Further, we performed keywor...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

A Neural Network Approach to Topic Spotting

This paper presents an application of nonlinear neural networks to topic spotting. Neural networks allow us to model higher-order interaction between document terms and to simultaneously predict multiple topics using shared hidden features. In the context of this model, we compare two approaches to dimensionality reduction in representation: one based on term selection and another based on Late...

متن کامل

Large-scale Controlled Vocabulary Indexing for Named Entities

A large-scale controlled vocabulary indexing system is described. The system currently covers almost 70,000 named entity topics, and applies to documents from thousands of news publications. Topic definitions are built through substantially automated knowledge engineering.

متن کامل

Detection of Difference between News Articles on the Same Topic Based on Sequential Comparison

Currently, a lot of news articles are published on theWeb, and it is getting easier for us to read them. However, the number of articles are too large for us to read all of them. Although some Web sites cluster/classify news articles into some topics (categories), it is not enough since a large number of articles are still in each topic. Detecting difference between articles on one topic will b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000